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Optiinal  Sequential  Designs  for  On-Line  Kern  Estimation 
Executive  Summary 

The  workhorses  of  modem  test  theory  are  so-called  item 
characteristic  curves  (iCC's);  these  are  mathematical  functions  which 
describe  how  the  probability  of  correctiy  answering  a  test  question 
changes  with  ability.  In  CAT  testing,  these  ICC's  are  used  both  to  select 
appropriate  problems  for  an  examinee  and  to  score  the  examinee's 
performance  on  the  selected  problems. 

The  ICC's  for  a  test's  problems  are  not  known,  but  must  be 
estimated  from  data.  Typically,  such  data  are  collected  before  the  test  is 
used  operationally  in  what  is  known  as  a  "calibration  study".  In  the  case  of 
CAT-ASVAB,  calibration  data  were  collected  by  administering  subsets  of 
the  questions  via  paper-and-pencil  test  booklets  to  50,000  applicants  for 
military  service  in  Military  Entrance  Processing  Stations  (MEPS). 

Such  off-line  calibration  studies  have  several  shortcomings:  Rrst, 
they  are  expensive  to  conduct  in  ttiat  they  make  heavy  demands  on  an 
overburdened  MEPS  command  aiKl  sometimes  prevent  same-day 
processing,  necessitating  that  applicants  be  billeted  in  hotels.  Second,  the 
performance  data  are  suspect,  since  examinees  are  told  that  their 
performarKe  on  the  non-operational  problems  will  not  "count".  Third,  the 
process  is  inefficient  in  that  the  random  sample  of  examinees  given  a 
particular  test  question,  is  usually  not  the  optimal  sample  for  estimating  that 
question's  ICC. 

It  is  widely  held  that  the  answer  to  the  shortcomings  of  "off-line" 
calibration  is  "on-line"  calibration,  in  on-line  calibration  one  gathers  the 
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needed  data  to  estimate  iCC's  by  unobtrusivety  seeding  a  small  number  of 
non-operational  items  into  an  applicants  operational  CAT  test.  If  ttie 
number  of  additional  items  given  to  each  applicant  is  small,  data  collection 
is  virtually  cost  free.  If  an  applicant  cannot  distinguish  non-operationai 
items  from  operational  items,  the  performance  data  will  better  reflect  his/her 
capabilities.  And,  if  the  non-operationai  items  are  embedded  in  an 
operational  test  which  is  administered  via  computer,  in  principle  one  should 
not  be  limited  to  collecting  data  from  random  samples,  but  could  employ 
some  optimal  sample  design  strategy.  This  work  seeks  to  develop  the 
wherewithal  for  dynamically  constructing  optimal  samples. 

One  can  think  of  the  optimal  sample  design  problem  in  the  following 
terms.  For  concreteness  suppose  that  we  have  500  new  test  problems  to 
be  calibrated  and  that  each  day  1000  applicants  are  tested  in  the  MEPS. 
Further  suppose  that  we  can  tolerate  at  most  2  non-operationai  items 
embedded  in  each  applicant's  operational  test.  Since  each  applicant  can 
be  assigned  2  items  from  a  set  of  500  items,  there  are  500  choose  2,  or 
124,750,  potential  allocations  for  each  applicant.  Since  there  are  1000 
applicants,  the  number  of  potential  allocations  overall  is  astronomical  ( or 
to  be  exact  1 .2475E1004,  where  xEn  means  x  multiplied  by  10  to  the 
power  n).  The  sample  design  problem  is  the  problem  of  allocating  the 
available  applicants  to  the  non-operationai  items  in  an  optimal  manner. 

If  one  is  to  improve  upon  random  allocation,  one  needs  three 
elements:  (a)  a  relevant  basis  on  which  to  distinguish  the  objects  to  be 
allocated  (in  this  case  the  applicants),  (b)  an  objective  function  which 
orders  the  set  of  potential  allocations  with  respect  to  some  measure  of 
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quality,  and  (c)  an  efficient  algorithm  for  "searching"  the  rather  large  space 
of  potential  allocations. 

This  research  developed  all  three  elements,  (a)  Since  examinee 
abilities  are  not  known,  the  applicants  were  distinguished  by  the  maximum- 
likelihood  estimates  of  their  ability  from  their  operational  CAT  test,  (b)  The 
objective  function  was  based  on  the  determinant  of  the  Rsher  information 
matrix  for  the  parameters  of  the  ICC.  And.  (c)  a  branch  and  bound 
algorithm  was  used  to  search  the  space  of  potential  allocations  for  the  set 
which  is  optimal. 

The  need  for  several  approximations  accompany  adoption  of  an 
objective  function  based  on  the  Rsher  information  matrix.  Rrst,  to  compute 
the  Rsher  information  matrix,  one  must  know  the  ICC.  This  requirement 
was  circumvented  by  employing  a  sequential  optimization  strategy  in  which 
initial  ICC  estimates  were  iteratively  refined  as  more  and  more  data  were 
gathered.  The  method  developed  for  updating  ICC  estimates  involved 
modeling  the  measurement  error  of  the  CAT  ability  estimate  and  using  this 
model  to  modify  the  maximum  likelihood  estimate  of  the  item  parameters. 
Without  this  modification  the  usual  MLE  is  biased. 

Optimal  allocation  via  Rsher  information  also  requires  that  abilities 
be  known.  In  this  case  maximum-likelihood  estimates  of  ability  were  used. 
The  maximum-likelihood  estimates  are  based  on  the  data  from  the 
operational  portion  of  the  CAT. 

Monte  Carlo  data  suggest  that  the  consequences  of  these 
approximations  were  not  severe.  Even  without  knowledge  of  the  actual 
ICC‘s  or  abilities,  the  approach  proved  to  be  at  least  90%  as  efficient  as  the 
theoretically  optimal  design  in  all  cases  studied.  Designs  based  on  random 
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allocation  of  applicants  to  items  were  less  than  30%  as  efficient  as  the 
sequential  design  algorithm  developed  here. 
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Optimal  Sequential  Designs  for  On-Line  Item  Estimation 

Abstract 

Replenishing  item  pools  for  on-line  ability  testing  requires  innovative 
and  efficient  data  collection  designs.  Based  on  a  theoretical  framework  for 
generating  exact  D-optimal-designs  for  selecting  individual  examinees,  and 
for  consistently  estimating  item  parameters,  this  article  presents  a 
sequential  procedure  for  on-line  item  calibration.  These  procedures  were 
derived  for  general,  dichotomous  item  response  models,  using  Welch 
(1 982)  for  exact  n-point  D-optimai  designs  and  Stefanski  and  Carroll 
(1 985)  for  consistent  estimators,  in  simulations,  these  designs  appear  to 
be  considerably  more  efficient  than  random  seeding  of  items.  Key  words: 
Branch-and-bound,  Computerized  adaptive  test.  Exact  n-point  D-optimai, 
integer  programming,  item  response  dteory.  Measurement  errors  modei, 
On-iine  testing,  Sequentiai  design. 
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Introduction 

Calibration  of  new  items  is  an  essential  part  of  a  testing  system, 
because  operational  items  eventually  become  overexposed  and  need 
replacement.  To  calibrate  new  items  for  the  Armed  Services  Vocational 
Aptitude  Battery  (ASVAB),  costly  testing  sessions  are  conducted  where  all 
the  new  items  are  presented  to  examinees  that  have  been  recruited 
expressly  for  the  purpose  (C.  E.  Davis,  personal  communication,  March  1 5, 
1991).  The  obtained  data  may  be  unreliable  because  the  examinees  know 
that  the  test  "doesnl  count"  and,  thus,  do  not  do  their  best.  On-line  item 
calibration  promises  to  yield  more  reliable  data  on  new  items  at  virtually  no 
cost.  This  research  is  concerned  with  the  development  of  item  calibration 
procedures  that  take  advantage  of  the  auxiliary  ability  estimates  supplied 
by  the  on-line  test.  This  will  enable  the  procedure  to  select  pre-specified 
ability  distributions  known  to  yield  high  information  regarding  a  given  item. 

Researchers  have  recently  focussed  on  the  effect  of  an  ability 
distribution  on  the  precision  of  the  estimate  of  an  item  parameter. 
Wingersky  and  Lord  (1984)  showed  that  when  item  and  ability  parameters 
are  estimated  simultaneously,  a  rectangular  distribution  of  ability,  instead  of 
a  normal  distribution,  reduces  the  standard  errors  of  all  parameters. 
Studying  the  standard  errors  of  the  estimates  of  the  item  parameters  only. 
Stocking  (1990)  concluded  that  a  broad  distribution  of  ability,  uniform  or 
bimodal,  was  better  than  a  bell-shaped  distribution.  In  addition,  she 
convincingly  argued  that  even  in  very  large  samples,  very  little  information 
may  be  available  for  calibrating  some  items,  and  that  the  success  of  a 
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particular  Kern  calibration  using  item  response  theory  depends  heavily  on 
the  selection  of  more  informative  data. 

The  theory  of  optimal  design  is  concerned  with  planning  data 
collection  so  that  they  will  be  as  informative  as  possible.  The  general 
theory  of  optimal  design  focuses  on  "minimizing”  the  variance-covariance 
matrix  of  the  parameter  estimates,  or  on  "maximizing"  the  inverse  of  the 
Rsher  information  matrix.  A  D-optimai  design  maximizes  the  determinant 
of  the  Rsher  information  matrix;  and,  an  A-optimal  design  minimizes  the 
trace  of  the  inverse  of  the  Rsher  information  matrix  (Federov,  1972;  and 
Siivey,  1980).  The  criteria  used  in  Wingersky  and  Lord  (1984)  and 
Stocking  (1990)  are  related  to  the  theory  of  A-optimaiity. 

R}rd  (1976)  derives  D-optimal  designs  for  logistic  regression 
functions,  also  known  as  two-parameter  logistic  item  response  models 
(Lord,  1980).  He  shows  that  discrete,  two-point  distributions  are  D-optimal, 
with  support  pointe  depending  on  the  values  of  the  item  parameters. 
However,  optimal  designs  for  the  two-parameter  logistic  functions  are 
unstable.  Siivey  (1980)  gives  an  example  showing  that  Ford's  design, 
optimal  for  one  item,  may  be  extremely  suboptimal  for  another  item,  even  if 
its  parameters  are  ciose-by.  Siivey  concludes  that  is  not  practical  to  use  a 
des.^ii  that  is  maximin  over  a  subset  of  candidate  values  for  the  unknown 
parameters.  A  design  is  maximin  if  it  maximizes  the  minimum  possible 
determinate  of  the  Rsher  information  matrix.  To  overcome  this  problem. 
Ford  and  Siivey  (1980)  study  sequentially  constructed  designs  for  two- 
parameter  logistic  models,  which  employ  D-optimal  "subdesigns"  based  on 
current  estimates  of  the  parameters.  This  research  will  show  that 
sequentially  constructed  designs  are  useful  for  on-line  item  calibration. 
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Ford  and  Sllvey's  sequentially  constructed  designs  apply  large 
sample  optimal  design  theory  to  small  sample  subdesigns.  The 
approximation  of  small  exact  designs  with  large  sample  designs  is 
sometimes  inadequate  (Welch,  1982).  Recently,  researchers  have 
developed  some  tiieory  and  procedures  for  finding  optimal  designs  for 
small  samples,  called  optimal  exact  N-point  designs.  For  linear  regression 
models,  Welch  (1 982)  investigated  a  branch-and-bound  algorithm  for 
finding  0-optimal  exact  N-point  designs  with  support  over  discrete  design 
space.  Also  for  linear  regression  models,  Haines  (1985)  investigated  a 
simulated  annealing  algorithm  for  N-point  designs  with  support  over  a 
continuous  design  space.  Additional  algorithms  are  presented  in  Oonev 
and  Atkinson  (1988)  and  Mitchell  (1974).  This  research  will  find  exact  N- 
point  designs  for  constructing  on-line  calibration  samples. 

In  contrast  to  logistic  regression  problems,  estimation  of  item 
response  models  must  use  an  estimate  of  the  covariate  instead  of  a  true 
value  of  the  covariate.  In  our  context,  the  covariate  is  an  estimate  of  the 
examinee's  ability  generated  from  the  operational  part  of  the  on-line  test. 
This  notion  is  in  accordance  with  Stocking  (1990),  who  notes  that  "a 
sample  could  possibly  be  selected  based  on  some  available  observed 
auxiliary  information."  Ford  and  Siivey  (1980)  did  not  solve  the  problem  of 
measurement  errors  in  the  covariate  when  they  did  their  study.  Earlier 
Federov  (1972)  considered  the  development  of  designs  in  the  presence 
measurement  errors  for  the  general  linear  model.  Independently  of 
researchers  in  the  field  of  optimal  design  for  logistic  models,  Stafanski  and 
Carroll  (1 986)  study  the  effect  of  measurement  errors  in  the  covariate  on 
the  asymptotic  bias  of  the  MLE  of  parameters  of  the  logistic  regression 
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model.  They  showed  that  the  MLE  is  asymptotically  biased,  in  other 
words,  not  consistent. 

Using  the  ability  estimate  for  each  examinee,  generated  by  the 
operational  part  of  the  on-line  test,  this  research  explores  sequential  D- 
optimal  designs  that  are  appropriate  for  three-parameter  and  other  item 
response  models.  In  addition  this  study  presents  an  estimator  of  item 
parameters  using  estimates  of  ability  in  place  of  true  ability.  This  research 
studies  the  relative  efficiency  of  the  normal  and  uniform  designs.  Because 
the  designs  are  compared  according  to  the  '’/-optimal  criterion,  this 
research  will  add  to  the  base  of  research  that  has  looked  at  individual 
standard  errors  of  estimates. 

Elements  of  Item  Response  Theory 
Item  Response  Functions 

Let  u.  denote  a  response  to  a  single  item  from  individual  i  with  ability 
level  X.,  possibly  multivariate.  Assume  that  all  response  variables  are 
dichotomously  scored  either  correct,  u.  =  1 ,  or  incorrect  u.  =  0.  An  item 
response  function  is  a  function  of  x.,  and  describes  the  probability  of 
correct  response  of  an  individual  with  ability  level  x.  when  presented  with 

the  item.  . ^  ^  parameters  associated  with  the 

item.  The  probability  of  a  correct  response  follows  the  form  P(x.;i8).  The 
mean  and  variance  of  the  parametric  family  are: 

E{u.|/8}  =  P(x.:/8),and 

a2(x.;/8)  =  Var{u.|/S}  =  P(x.;/8)[1  -  P(x.:^]. 


(1) 

(2) 
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An  example  of  a  family  of  item  response  functions  is  the  celebrated 
family  of  three-parameter  logistic  response  functions: 

P(x.:^  =  ^2  +  (1  -  ^2)R(^o  +  (3) 

where  R(z)  is  the  logistic  function: 
z 

B(z)  =  (4) 

1  +e 

Another  functional  form  for  R(z)  is  the  normal  ogive,  see  Lord(1980)  for 
references. 

Throughout  the  remainder  of  this  article,  we  will  assume  that  the 
response  variables  are  binary  and  are  statistically  independent  given  the 
ability  level.  The  optimal  design  results  of  this  paper  can  be  extended  to 
response  fururtions  other  than  the  three-parameter  family.  All  that  is 
necessary  is  that  the  item  response  models  must  be  differentiable  In  the 
item  parameters. 


The  log-likelihood  function  of  &  based  on  independent  observations 


(u.,x.:  i  =  1 ,2 . N)  is 


Lh(^:u, . u^,x^,...,x^  = 

2{u,log  P()(,;^)  +  (1  -  u.)log[1  -  P(x.;«lj. 


(5) 
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Under  regularity  conditions,  the  MLE  is  asymptotically  normal  with 
mean  0  and  variance-covariance  matrix  where  M(/3)  is  the  Rsher 

information  matrix  with  elements  (m|^|(/8))  equal  to: 


N 

my(x:^  =  2  ‘y-2(Xj:^ 

i=1 


ap(Xi:^  ap(Xi;/8) 
diS„  d/8, 


(7) 


When  the  ability  levels  are  observed  with  error,  the  MLE  is  an 
asymptotically  biased  estimator  of  0,  or  equivalently,  it  is  not  consistent. 
Stefanski  and  Carroll  (1985)  suggested  several  modifications  of  the  basic 
MLE  in  the  logistic  regression  model  that  tend  to  reduce  the  bias.  We  base 
our  modification  on  one  of  their  suggestions.  We  first  describe  a  plausible 
model  for  the  observed  ability  level. 


Measurement  error  in  abilitv  level.  Let  x.  denote  the  true  ability  of  an 
examinee,  and  let  X.  denote  the  observed  ability  obtained  in  an  on-line 
testing  system.  Let  e.  denote  the  measurement  error  associated  with  the 
on-line  test.  Then  a  measurement  model  is 


X.  =  X.  +  e.. 

I  I  I 


(8) 


Assume  X.,  x.,  and  e.  are  independent  and  denote  the  variance  of  (X.,  x., 
by  (o^x.  'xx'  "ee)' 


XX  XX  ee 


The  reliability  ratio  for  this  measurement  error  model  is  (Fuller,  1987): 


(10) 
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We  will  call  this  ratio  the  t^t re/iabi/ity ratfa  The  traditional  notion  of  test 
reliability  is  associated  with  classical  test  theory.  The  definition  of  test 
reliability  is  relative  because  the  variance  of  test  scores  in  the  population 
depends  on  the  design  we  wish  to  achieve.  For  a  fixed  measurement  error 
^ee’  ^  perform  well  for  a  design  where  is  large.  However,  the 

same  test  with  the  same  measurement  error  may  perform  poorly  for  a 
design  where  is  small.  This  notion  of  relative  performance  of  test 
scores  Is  well  known  by  test  practitioners;  for  example,  an  aptitude  test 
developed  for  a  general  population  is  usually  unsuited  for  the  purpose  of 
selection  with  a  cut-off  score.  (This  is  because  the  measurement  error  of 
the  test  is  not  small  enough  to  distinguish  between  examinees  with 
aptitudes  just  to  the  right  or  left  of  the  cut-off  score.) 

It  is  possible  to  control  o  in  an  on-line  testing  environment 

because  o  decreases  as  the  number  of  administered  items  increases  and 
ee 

as  the  on-line  test  administers  items  with  high  information.  The  reliability 
ratio  will  be  a  tuning  variable  for  the  sequential  design  algorithm. 
Values  in  the  range  75%  to  95%  are  meaningful.  The  quantity 
depends  on  the  sample  design  and  determines  the  required  value  of 
for  a  given  value  of  the  reliability  ratio  according  to: 

"ee  =  "xx<’-'<xx)'  <”> 

MLE  modified  for  measurement  error.  Assume  that  the  error  terms 
(ej)  in  model  (8)  are  independent  normally  distributed  with  mean  0  and 

variance  o _ Under  this  measurement  error  model  the  MLEs  of  ^  and  x  = 

(x.|,  X2,...,X||P  maximize 
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L^{0m)  = 

2|UjlogP(x.:^  +  (1  -  u^logll  -  P(x.;^] 

1=1^  (12) 

-(2"ee)-’<X|-Xi)") 


The  vectors  0^,  (X|)  maximizing  expression  (12)  satisfy 


N 

2 


Uj-P(x^ 


dP(x;0^ 


i=1  S0. 


=  0.k  =  0.  .,M-1: 


X  S  X  + 


Ui-P(Xj:^f) 


«P(x:^P 


’^(xi:0f) 


ee 


1  =  1,  ..N. 


(13) 


(14) 


A  modification  of  this  set  of  equations  enables  an  easy  implementation. 
This  modification  replaces  X|  by 


(15) 


We  will  call  the  resulting  estimator  of  0  the  Stefanski 

and  Carroli  showed  that,  in  logistic  regression,  the  modified  MLE  reduces 
asymptotic  bias  for  known 


Effective  sample  size  required  for  the 


led  MLE. 


Because  the 


data  follow  the  Bernoulli  probability  model,  the  variance  of  estimators 
based  on  maximum  likelihood  decreases  approximately  at  the  rate 
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[NP(1-P)^^  where  P  Is  the  value  of  the  response  probability  averaged 
over  the  design  values  ,X2 . Xj^.  Thus, 

S=NP(1-P)  (16) 

is  sometimes  called  the  effective  sample  st^a  For  example,  N  =  300, 600 
and  P  =  0.10,  the  effective  sample  size  is  approximately  30  and  60. 

Stefanski  and  Carroll's  Monte  Carlo  study  showed  that  the 
increased  variability  of  the  modified  estimators  outweigh  their  savings  in 
bias  and  thus  under  perform  the  regular  MLE  for  N  =  300.  However,  for  N 
=  600,  they  showed  that  the  modified  estimator  out  performs  the  regular 
MLE.  They  chose  P  =  0.10  and  a  reliability  ratio  equal  to  0.75  for  their 
Monte  Carlo  study.  Because  they  were  interested  in  the  performance  of 
the  estimator  only,  they  did  not  investigate  sequentially  constructed 
designs  and  allowed  the  design  to  be  observational,  resulting  in  a  normally 
distributed  design. 

Because  we  expect  to  see  P  fall  in  the  range  0.4  to  0.6,  and 
because  we  will  use  0-optimal  designs  instead  of  normal  designs,  the 
modified  estimator  should  outperform  the  regular  MLE  for  an  effective 
sample  size  of  no  more  than  60.  We  shall  see  later  that,  with  no  more  than 
25  per  cent  of  the  observations,  a  0-optimal  design  yields  the  same  amount 
of  information  as  does  a  normal  design.  According  to  the  Stefanski-Carroli 
study,  this  implies  that  a  modified  MLE  should  start  to  out  perform  the  usual 
MLE  with  an  optimal  design  at  about  0.25(60),  or  equivalently  15,  effective 
observations.  For  P  In  the  range  0.4  to  0.6, 1 5  effective  observations 
translates  into  no  more  than  N  =  15/[(0.4)(0.6]  <  63  actual  observations. 
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To  achieve  the  bias  reducing  properties  of  a  modified  estimator,  one 


must  controi  <7^  with  equation  (1 1).  We  wili  see  that  wiil  range  from  1 
to  about  0.4  for  D-optimai  designs.  By  equation  (11).  this  impiies  that  a 
wiil  range  from  0.1  to  0.25  for  a  fixed  equal  to  0.75.  This  transiates 
into  requesting  the  on-iine  test  to  supply  estimates  of  ability  leveis  with 
standard  errors  of  measurement  in  the  range  0.31  to  0.50. 

This  is  a  reasonabie  request  to  make  of  an  on-iine  test  that  has  been 

developed  for  a  general  population  with  normally  distributed  ability.  To  see 

this,  recall  that  the  three-parameter  logistic  model  associates  with  it  a 

population  of  latent  ability  levels,  normally  distributed  with  standard 

deviation  =  1 .7  (Lord,  1 980).  Assume  the  reliability  of  the  on-line  test  is 

Kxx  =  relative  to  this  normal  population.  Using  (1 1),  it  is  easy  to 

show  that  the  variance  of  the  measurement  error  is  o  = 

ee 

(1  -  Kxx)(Kxx)*^  V  =  ■  0-97) (0.97) (1 .7)2  =  0.08409.  Thus,  if  an  on¬ 

line  test  has  reliability  0.97  for  a  normal  population  with  standard  deviation 
1 .7,  then  the  on-line  test  would  have  been  developed  to  supply  estimates 


of  ability  with  standard  errors  of  measurement  equal  to  VO.08409  =  0.29. 
The  value  of  0.97  for  the  on-line  reliability  is  the  lowest  value  we 
recommend. 


Optimal  Design  Theory  for  Item  Calibration 

Definitions 

N-point  optimal  designs.  The  sampled  ability  level  x  will  be 
constrained  to  a  subset  I  of  R*^,  where  k  Is  the  dimension  of  the  ability.  X 
will  be  called  the  design  space  An  H- point  design\&  a  collection  of  ability 
leveis  X  =  (Xi . x,g)  from  the  design  space  I.  The  expected  value  of  the  i^ 


optimal  Sequential  Designs  for  Item  Estimation 

Page  17 

response  variable  follows  the  response  function  with  form  PiCXj./^.  For 
each  Xi  denote  the  M-column  vector  of  partial  derivatives  as  = 

The  design  problem  is  to  choose  x  with  Xj  e  I,  i  =  1 . N  to  make 


N 

M(x:^= 

1=1 


dP(x,:/8)  dP(x,;^'r 

d|3  djS 


(17) 


as  large  as  possible,  where  "large”  is  an  appropriate  attribute  for  a  matrix. 

Criteria  for  generating  designs.  There  are  several  criteria  for 
generating  designs,  see  Siivey  (1980)  for  an  exposition.  We  list  three 
criteria  here.  The  criterion  of  D-optimaiity  is  the  determinant  of  the 
information  matrix:  det  {M(x;/3)}.  its  square  root  is  inversely  proportional  to 
the  volume  of  the  confidence  ellipsoid  for  The  criterion  of  A-opUmality  is 
the  trace  of  the  inverse  of  the  information  matrix:  tr  ({M(x;i8))-l].  It  is 
proportional  to  the  average  of  the  variances  of  the  estimated  parameters. 
The  criterion  of  ^ong^  optimality  is  the  partial  order  on  matrices  induced 
by  the  condition  of  non-negativity:  if  x-i  and  X2  are  two  N-point  designs 
then  Xi  is  better  than  X2  in  the  strong^^'si^  if  M(x-| \ft)  -  M(x2;i8)  is  non¬ 
negative  definite.  Both  the  determinant  and  the  trace  functions  belong  to  a 
general  class  of  criterion  functions  (0)  which  are  necessary  for  ^rong 
optimality:  if  Xi  and  X2  are  two  N-point  designs  such  that  Xi  is  better  than 
X2  in  the  sttntg^i&cxsi^  such  that  M(x^;/S)  -  M(X2;/^  is  non-negative  definite 
then  <i{M(x^:/8)}  ^  0{M(X2;/^}. 

We  have  introduced  the  criterion  of  A-optimality  primarily  because  of 
its  obvious  relation  to  the  studies  of  Wlngersky  and  Lord  (1984)  and 
Stocking  (1990).  These  studies  indicate  that,  when  using  a  random  design 
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with  random  sampling,  a  rectangular  distribution  over  ability  level  is  better 
than  a  normal  one  according  to  the  criterion  of  A-optimallty.  We  will 
concentrate  on  the  criterion  of  D-optimaiity. 

The  solution  to  the  problem  of  finding  an  N-point  D-optimal  design 
reduces  to  solving  the  mathematical  programming  model: 

Maximize  det{M(x;i3)} 

such  that 

x.€l.i  =  i . N.  (18) 

This  mathematical  programming  model  has  wide  applicability  to  item 
response  theory.  All  that  one  needs  is  tfie  Rsher  information  function  (7) 
for  the  targeted  item  response  models,  it  can  also  be  expanded  to  cover 
designs  for  simultaneous  item  calibration. 

Approximate  theory  of  D-ootimal  designs.  Many  theoretical 
techniques  are  available  in  D^ptimal  design  theory  for  the  problem  with 
the  criterion  extended  to  probability  distributions  over  the  design  space. 
This  is  called  the  ^^qrvz^Jkz^iHheory  of  optimal  design.  We  derive 
approximate  D-optimal  designs  for  the  purpose  of  comparing  sequential 
designs,  normally  distributed  designs,  and  rectangular  designs. 

Let  us  extend  the  criterion  to  probability  distributions  over  the 
design  space  by  first  considering  a  finite  design  space.  Denote  points  of 
the  design  spaces  by  the  distinct  values  x^i A  design  will  replicate 
these  design  sites  n-i  ,...,np  times,  respectively.  We  associate  with  an  N- 
point  design  x  a  probability  distribution  on  I:  which  puts  probability  p| 

=:  n{/N  at  x^j)  Let  X  be  a  random  variable  with  distribution  and  redefine 
the  information  matrix  associated  with  the  as: 
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M(rj„;/8)  =  E 


(19) 


=  - ^  =N-lM(x:i8). 

i=1 


Now  it  is  straight  forward  to  define  the  information  matrix  of  an  arbitrary 
probability  distribution  over  a  design  space,  as  follows.  Assume  the  usual 
probability  space  over  I.  Let  r;  be  a  probability  distribution  and  x  a  random 
ability  level  with  distribution  n,  define 


=  E 


ap(x;/3)  dP(x:0)^ 

a/8 


(20) 


Duality.  Sibson  (1972)  showed  that  the  approximate  D-optimal 
problem  is  dual  to  another  mathematical  programming  model.  Sibson's 
result  deals  with  the  linear  model.  We  extend  ttiis  result  to  non-linear 
models  by  defining  the  manifold  in  induced  by  the  design  space  I  as 
follows: 


m  =  {z  e  RM;  z  =  [a(x;/8)J-iap(x:^/ai8,  X  e  J).  (21) 

The  set  171  depends  on  0.  Sibson'  result  shows  that  the  D-optimal  problem 
is  dual  to  the  problem  of  finding  a  minimal  content  ellipsoid  contained  in 
R^  centered  at  the  origin  that  contains  the  manifold  TH  We  present  this 
fact  not  because  it  leads  to  practical  solution  methods  but  because  it  leads 
to  deep  theoretical  insight  about  the  nature  of  optimal  designs. 


Relative  efficiency.  To  evaluate  the  efficacy  of  a  given  design  we 
introduce  the  notion  of  r^tive  ^Rciency  Let  M*(^  be  the  information 
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matrix  of  the  approximate  optimal  design  for  a  given  value  of  the  item 
vector,  fi.  Then  the  relative  efficiency  of  a  design,  v,  is  defined  as 


eff(n)  = 


det{M(T>;^} 
det{M*(/8)}  ■ 


(22) 


Random-seedina^d  uniform  designs.  An  observational  design 
would  occur  if  the  experimenter  did  nothing  and  let  the  observations  occur 
with  random  ability  levels.  For  example,  practitioners  sometimes  assume 
that  the  naturally  occurring  distribution  of  a  unidimensional  ability  is 
normal.  The  normal  distribution  associated  with  the  three-parameter 
logistic  model  has  a  mean  of  zero  and  a  standard  deviation  of  1 .7.  Let  0 
denote  the  standard  normal  distribution.  For  this  example  i7(x)  =  0(x/1 .7). 
This  design  occurs  naturally  if  examinees  are  selected  at  random  to  receive 
an  experimental  item.  This  is  known  as  random-seedingxA  experimental 
items.  The  rectangular  or  uniform  design  studied  by  Wingersky  and  Lord 
(1984)  and  Stocking  (1990)  also  can  be  formulated  with  an  v. 

Algorithms  for  D-optimal  designs.  Using  the  two-parameter  logistic 
model.  Ford  (1976)  obtained  approximate  D-optimal  designs  that  have 
closed  formulas,  see  Appendbc  A.  In  general,  no  closed  solutions  to  model 
(18)  exit.  We  will  use  Ford's  approximate  solutions  to  compare  with  other 
designs.  We  will  employ  a  search  method  to  solve  the  problem  (18)  for  N- 
point  designs  when  N  is  small,  which  we  will  use  to  find  sequentially 
constructed  designs.  We  base  our  method  on  those  that  have  been 
developed  for  the  linear  model,  which  we  discuss  now. 

In  case  of  linear  statistical  models,  there  exit  several  heuristic  search 


techniques  (Federov  p.  167, 1972;  Welch, 1982;  and  Haines,  1987). 
Federov's  algorithm  is  based  on  gradient  search,  but  applies  to  only  linear 
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models  arKJ  finds  approximate,  not  N-point,  optimal  designs.  Haines' 
algorithm  is  based  on  simulated  annealing,  handles  both  D-optimal  and  A- 
optimal  criteria,  produces  N-point  designs,  and  is  simple  to  code.  Welch's 
algorithm  is  based  on  branch-and-bound  search  over  designs  confined  to 
a  finite  set  of  pre-specified  points  and  finds  exact  N-px>int  D-optimal 
designs.  Among  these  three  algorithms,  branch-and-bound  search 
requires  the  least  amount  of  computer  time  for  most  practical  problems.  We 
employ  the  branch-and-bound  heuristic  to  find  small-sample  designs  for 
use  in  the  sequential  design  scheme  described  in  the  next  section.  We 
present  the  branch-and-bound  search  in  Appendix  B. 

Results  and  Conclusion 

Approximate  optimal  designs  for  two-parameter  looistic  items.  Ford 
(1976)  derived  a  formula  for  approximate  D-optimai  designs  for  the  two- 
parameter  logistic  model.  We  list  the  description  in  Appendbc  A  for  the 
design  space  I  =  [-1 ,  +1]  and  we  will  use  this  design  space  for  the 
following  results.  This  design  space  corresponds  to  the  notion  of 
concentrating  the  design  to  single  out  individuals  with  more  informative 
ability  levels  than  would  be  the  case  with  a  normal  distribution  (Stocking, 
1990). 

The  main  characteristic  of  D-optimal  designs  is  that  the  approximate 
optimal  design  puts  one-half  its  probability  at  each  of  /Midpoints.  This 
follows  from  the  duality  theory  and  the  shape  of  the  manifold  in  R2, 
induced  by  the  design  space.  To  see  this,  note  first  that,  for  the  two- 
parameter  logistic  model  M=2, 


(23) 


Optimal  Sequential  Designs  for  Hem  Estimation 

Page  22 


hence 


m  =  {zeR2:  z  =  {P(x:/S)[1  -P(x:/8)I>i«[1  .xJT,  xel}  (24) 


The  manifold  fllis  a  smooth  curve  in  R2  because  the  logistic  model 
is  infinitely  differentiable.  Graphically,  it  resembies  the  curve  of  a  smooth 
"C  with  ends  curved  back  toward  the  origin.  Analyticaily,  the  form  of  its 
defining  equations  are  not  elliptical.  HerK:e,  the  optimal  design  has  a  two- 
point  distribution  because  the  subset  S  touches  a  minimal  area  ellipse  in 
R2  centered  at  the  origin  at  two  points  . 

in  Table  1 ,  we  have  listed  the  a  and  b  parameters  of  three  logistic 
curves  along  with  their  support  points  of  the  approximate  D-optimal  design 
.  In  terms  of  our  parametric  representation,  the  conventional  discrimination 
parameter,  a,  and  difficulty  parameter,  b,  of  the  two-parameter  logistic 
model  are: 


a  =  ^. 


and  b  = 


^0 


(25) 


Each  set  of  parameters  are  representatives  from  the  regions  ,  B2  and  Bg 
(A-1  to  A-3).  We  will  characterize  these  three  sets  of  item  parameters  as 
"low,"  "medium,"  and  "high"  in  reference  to  the  value  of  the  discrimination 
parameter.  The  three  values  of  the  discrimination  parameter  appear  to 
reflect  the  values  seen  in  practice.  Note  that  the  support  points  of  the 
optimal  design  are  not  found  at  tiie  extreme  points,  but  at  the  interior  points 
of  [-1 ,  +1]  In  some  cases. 
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Table  1  about  here. 


In  these  two-parameter  logistic  models,  we  have  restricted  the 
design  space  to  the  interval  [-1 ,  +1].  If  we  widened  the  interval,  the  support 
points  for  the  “low"  and  “medium"  items  will  no  longer  be  the  end  points  of 
the  interval.  Thus  optimal  designs  are  not  merely  the  extreme  points  of  the 
interval.  This  is  the  major  feature  in  which  optimal  designs  for  rK>n-linear 
models  are  different  from  optimal  designs  of  linear  models.  The  reason  for 
this  phenomenon  is  that  manifold  Vi,  defined  above,  is  non-linear  in  x. 

For  the  three-parameter  logistic  model,  the  manifold  is  a  subset  of 
R3.  It  does  not  have  an  elliptical  surface  and  therefore  it  will  touch  its 
minimum  content  ellipsoid  in  three  points.  The  value  of  these  three  points 
and  the  corresponding  design  probabilities  are  complicated  and  will  not  be 
presented  here. 

We  will  not  use  these  theoretical  results  in  our  construction  of 
sequential  designs  because  we  desire  to  obtain  designs  for  response 
models  other  than  the  two  or  three  parameter  logistic  models.  Also  we 
seek  exact  N-point  optimal  designs  for  which  approximate  optimal  designs 
may  not  fit  well.  However,  we  will  use  these  theoretical  results  to  compare 
with  some  ad  hoc  designs,  such  as  the  normal  or  rectangular  distribution  of 
ability  level. 

Relative  efficiencies  of  some  /a/nAm  seet/Z/tadeslans  for 
estimating  the  two-oaramcter  logistic  response  model.  The  relative 
efficiency  of  a  continuous  design  n  for  a  two-parameter  logistic  model  is 
obtained  from: 


Optimal  Sequential  Designs  for  item  Estimation 

Page  24 

M(1?«  =  J  P(x:«li  -P(x:«i[’  ]il(x)dx.  (26) 

1 

In  case  n  corresponds  to  a  rectangular  or  normal  density,  the  integrals  may 
be  readily  evaluated  with  quadrature  methods. 

We  have  determined  the  relative  efficiencies  for  these  two  designs 
for  the  three  logistic  models  discussed  previously  (Table  2).  Although  the 
normal  design  performs  better  than  the  uniform  design,  we  note  that  both 
the  normal  and  uniform  designs  perform  poorly.  Indeed  the  best  relative 
efficiency  that  the  normal  design  attains  is  only  23.56%  for  the  Item 
parameters  (2.3851 ,  -.1745).  In  estimating  these  item  parameters,  this 
means  roughly  that  for  100  observations  from  the  normal  design,  one  may 
obtain  the  same  amount  of  accuracy  with  12  observations  at  -0.82  and  12 
observations  at  0.47. 

Table  2  about  here. 

That  the  normal  design  performs  better  than  the  rectangular  design 
raises  a  question  about  the  consistency  of  these  results  with  those  of 
Wingersky  and  Lord  (1984)  and  Stocking  (1990).  An  explanation  is  that 
we  are  comparing  designs  according  to  the  criterion  of  D-optimality 
whereas  they  compare  designs  by  a  criterion  directly  related  to  A- 
optimallty.  Rectangular  designs  are  better  (worse)  than  normal  designs 
according  to  the  criterion  of  A-optimality  (D-optimality).  This  result  is  not 
inconsistent  with  the  criterion  of  strong  optimality  because  both  D- 
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optimality  and  A-optimality  criteria  are  necessary,  but  not  sufficient  for  a 
design  to  be  strongly  optimal.  We  have  not  investigated  the  criterion  of 
strong  optimality  for  rectangular  and  normal  designs. 

Sequential  Design  Theory  and  Monte  Carlo  Study 

Mfitbsda 

Sequential  designs.  Because  of  the  dependence  of  an  optimal 
design  on  the  item  parameters,  it  is  impossible  to  employ  the  optimal 
design  in  practice.  To  overcome  this  drawback,  we  sequentially  construct 
reasonable  designs.  The  construction  procedure  collects  the  total  sample 
in  small  subdesigns,  size  n,  that  are  n-point  D-optimal  for  the  current 
estimates  of  the  item  parameters.  All  this  takes  place  within  the 
environment  of  on-line  tests,  which  provide  estimates  of  ability  levels.  The 
estimates  of  the  item  parameters  gradually  Improve  as  the  overall  sample 
size,  N,  increases.  To  ensure  this  improvement,  the  estimate  accounts  for 
error  in  the  estimated  ability  levels  using  tiie  modified  MLE  (13)  and  (15). 

We  consider  a  simple  framework  for  sequentially  constructing  a 
design.  We  obtain  a  sequential  design  by  repeatedly  cycling  through  three 
steps  (Rgure  1).  Step  one  obtains  responses  to  the  item  from  a  small 
number  of  examinees  whose  abilities  satisfy  the  design  from  step  three. 
Step  one  accumulates  these  data  with  prior  data  and  obtains  estimates  of 
the  item  parameters.  Step  three  obtains  a  small  sample  design  that  is  an 
exact  n-point  optimal  design  for  model  (18),  where  the  estimates  of  step 
two  substitute  for  &.  A  sequential  design  results  in  an  empirical  design  that 
will  rarely  equal  an  exact  n-point  D-optimal  design;  but  will,  however, 
approximate  the  optimal  design  associated  with  the  unknown  item 
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parameters.  We  determine  how  well  the  sequential  designs  perform  by 
computing  their  relative  efficiencies  (22). 


Rgure  1  about  here. 


Simulation  study.  We  fully  describe  the  sequential  algorithm  in 
Appendix  C.  One  must  fix  several  tuning  parameters  for  each  application 
of  the  algorithm.  These  tuning  parameters  are;  n,  the  size  of  the 
subdesign;  N,  the  overall  sample  size  or  stopping  time;  Ky^,  the  test 
reliability  (10);  S,  the  effective  sample  size  (16).  We  obtained  simulations 
of  designs  for  calibrating  two-parameter  logistic  items.  We  varied  the  tuning 
parameters  as  follows:  n  =  3,  5, 15;  N  =  200,  400;  Kyy  =  0.75, 0.80, 0.85, 
0.90, 0.95, 1 .00;  S  =  1 , 15,  30, «.  For  brevity,  we  did  not  duplicate  some 
of  the  reliability  ratios  between  N  =  200  and  400. 

We  fixed  the  design  space  to  be  twenty  points  unequally  spaced 
along  the  interval  [-1 ,  +1]  (Table  C-1).  These  twenty  points  correspond  to 
ten  pairs  of  support  points  for  ten  approximate  D-optimal  designs.  The 
items,  for  which  these  ten  designs  are  optimal,  are  listed  in  Table  C-1 
under  the  column  heading  "associated  items."  Assuming  that  experimental 
items  have  difficulties  between  -1  and  +1 ,  this  design  space  enables  the 
algorithm  to  expose  items  to  examinees  with  ability  levels  no  farther  away 
than  two  units  from  the  difficulty  parameter.  We  chose  these  ten  items  to 
be  representative  of  the  range  of  discrimination  parameters  and  the  three 
regions  Bi ,  62,  and  B3  defined  in  the  equations  (A-1 ,  A-2,  and  A-3). 
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By  symmetry  arguments,  the  simulations  results  apply  to  Items  with 
difflcully  parameters  obtained  from  the  difficulty  parameters  of  the  items  in 
Table  1  reflected  around  zero.  The  design  space  would  be  obtained  also 
by  reflecting  the  design  space  in  Table  C-1  around  zero. 


Reaulta  and  Caacluston 

The  relative  efficiencies  of  sequentially  constructed  designs  are 
presented  in  Tables  4  and  5.  The  overall  impression  conveyed  by  the 
tables  is  that  sequentially  constructed  designs  are  reasonable  designs  for 
item  calibration.  Indeed,  the  lowest  efficiency  is  0.77  for  N  *  200,  and  0.72 
for  N  •  400. 

Performance  of  modified  MLE's.  Generally  speaking,  the  regular 
MLE  was  superior  to  the  modified  one.  The  exceptions  were  for  the  large 
discrimination  parameter  in  two  cases:  a)  with  n  =  15,  N  s  180,  in  Table  3; 
and  b)  with  n  =  3,  N  s  402,  in  Table  4. 

Large  reliability  ratios  tend  to  degrade  the  performance  of  the  MLE: 
the  more  error  in  the  estimated  ability  level,  the  worse  the  MLE  performs. 
Even  when  there  is  no  error  in  the  ability  level,  the  modified  MLE  does 
worse  than  the  unmodified  MLE  for  n  =  15  and  the  item  with  the  large 
discrimination  parameter  (Table  3).  This  is  because  the  performance  of 
any  design,  that  is  not  optimal,  is  poor;  and,  the  instability  of  the  modified 
MLE  exacerbates  this  problem,  especially  when  any  one  subsample 
represents  a  high  proportion  of  the  oxgall  design. 

Effects  of  sample  size.  It  appears  that  n  =  5  is  superior  to  3  in  Tables 
3  and  4.  However  too  large  a  subsampie  size  is  bad,  as  seen  with  n  =  1 5 
in  Table  3.  Because  our  simulation  ran  too  long,  we  do  not  report  results 
for  n  =  1 5  and  N  =  400.  As  expected,  the  greater  the  overall  sample  size. 
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the  higher  relative  efficiency  attained  with  the  sequentially  constructed 
designs. 

It  was  thought  that  the  modified  MLE  would  compensate  better  than 
it  did  for  the  error  in  the  ability  level  for  the  effective  sample  sizes  S.  and 
overall  sample  size  N.  that  we  studied.  However,  asymptotic  theory 
suggests  that  the  modification  becomes  worthwhile  at  a  larger  sample  size 
than  N  =  400.  Another  possible  reason  for  these  poor  results  is  that  the 
effective  sample  sizes  studied  were  too  small.  In  TaUe  4,  we  see  that  the 
modified  and  regular  maximum  estimators  are  operatively  equal  for  S  =  30. 

We  do  not  present  the  results  here  but  we  have  studied  S  =  40,  50 . 100. 

We  found  that  modified  MLE  does  only  slightly  better  than  the  regular 
MLE. 

Laying  the  performance  of  the  modification  aside,  these  results 
compare  extremely  well  with  random-seeding  of  items.  We  saw  that  the 
best  that  random-seeding  did  was  27%  efficiency,  whereas,  the  worst  that 
sequential  designs  did  is  about  95%  efficiency  with  the  best  configuration: 
no  modification,  n  =  5,  and  N  =  400  (Table  4). 

In  conclusion,  these  resulte  suggest  that  one  should  implement 
sequentially  constructed  designs  utilizing  the  regular  MLE,  a  subsample 
size  of  5,  and  overall  sample  size  of  N  =  400.  This  configuration  performs 
well  even  for  items  with  large  discrimination  parameters  and  on-line  tests 
with  low  reliability. 

Discussion 

Modified  MLE.  Let  us  assume  that  we  have  a  consistent  estimator 
of  0 .  The  relative  efficiency  of  a  sequential  design  approaches  unity  as  the 
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total  sample  size  increases,  or  in  other  words,  a  sequential  design  is 
asymptotically  optimal  (Wu,  1985).  However,  the  rate  at  which  the  reiative 
efficiency  approaches  one  is  important.  As  we  have  seen,  an  instabie 
estimator  resuits  in  inefficient  designs  for  moderate  sample  sizes.  The 
theory  on  this  topic  fails  in  the  general  area  of  second  order  efficiency; 
however,  the  theory  is  incomplete  for  sequential  designs. 

Designs  for  simultaneous  item  estimation.  Practical  constraints  may 
dictate  that  designs  for  several  items  shall  be  constructed  where  items  must 
"compete"  with  each  other  for  a  /'/77>dl^umber  of  examinees  with  optimal 
ability  ieveis.  In  addition,  frugal  utilization  of  each  calibration  session 
requires  that  all  the  examinees  be  used.  The  mathematical  programming 

model  for  this  situation  is  as  follows;  Let  I  =  (Xj:  i  s  1 . r)  be  a  finite 

collection  of  candidate  ability  levels;  m|  denote  the  number  of  examinees 
with  ability  x.  available;  and  m.,.  =  n.  The  "plus"  notation  denotes 
summation  over  the  index.  Suppose  we  are  to  calibrate  c  items,  each  with 
response  function  Pj(x;i8),  j  =  1  ,...,c.  A  collection  of  c,  nj-point,  designs  for 
simultaneously  calibrating  c  items  is  (n|j),  non-negative  integers,  where  n4.j 
=  nj,  j  =  1  ,....c;  n^.  =  n.  Associate  with  each  nj-point  design  a  probability 
distribution  on  I:  i7nj.  which  puts  probability  py  =  njj/nj  at  X|,  Let  Mj(T7nj;/^ 
denote  the  information  matilx  associated  with  item  j.  A  mathematical 
programming  model  is: 
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5n: 


max  2;^logdet{Mj(T?„j;^} 

1=1 

(27) 

such  that 

nj*  =  mj,  1  =  1 . r; 

(28) 

n^j  =  Uj,  j  ~  1  .....c. 

(29) 

m*  =  n<.  =  n 

(30) 

nj:  ^  0.  integer. 

(31) 

The  criterion  (27)  is  related  to.  but  not  equal  to,  D-optimality;  that  is 
to  say  the  criterion  does  not  correspond  to  the  joint  confidence  ellipsoid  of 
all  c  sets  of  item  parameters.  Another  criterion  is  the  simpie  summation  of 
all  c  information  matrices;  however,  this  criterion  is  not  equal  to  the 
criterion  of  D-optimality.  Because  the  "log-det”  function  is  strictiy  concave 
over  the  space  of  non-negattve  definite  matrices,  the  criterion  (27)  is  a 
lower  bound  to  the  logarithm  of  the  weighted  sum  of  c  information  matrices, 
with  weights  nj/n. 

The  solution  to  the  problem  (27)  -  (31)  may  be  solved  with  the 
branch-and-bound  technique.  The  values  for  mj  in  constraint  (28)  are 
uncontrolled,  resulting  from  expected  flows  of  examinees  at  individual 
testing  sites.  Constraint  (29)  enables  the  practitioner  to  controi  the 
proportion  of  the  total  observations  allocated  to  any  one  item.  This  is  an 
Important  degree  of  freedom  as  one  may  desire  to  spend  less  observations 
on  items  with  imprecise  estimates  of  difficulty  relative  to  the  other  items. 
Constraint  (30)  is  also  uncontrolled  and  determined  by  the  flow  of 
examinees. 

On-line  item  seeding  and  non-interference  with  the  tesfino  of  the 
subjects*  ability  level.  To  limit  the  exposure  of  examinees  to  inappropriate 
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items,  one  may  either  constrain  the  design  space,  add  more  constraints  to 
the  mathematical  programming  problem,  or  reformulate  the  Information 
matrix.  Our  simulations  limited  the  design  space  to  the  interval  [-1,-i-l].  If 
most  experimental  items  have  difficulty  in  the  range  -1  to  +1 .  then  this 
design  space  makes  it  unlikely  that  examinees  are  exposed  to  an  item  more 
than  two  units  away  from  their  ability.  Other  intervals  could  also  be  used. 

If  a  range  of  item  difficulties,  wider  than  [-1 ,  +1],  is  anticipated,  then 
one  could  allow  the  candidate  design  points  to  be  spread  out  in  the 
appropriate  interval,  and  also  place  constraints  on  the  distance  between 
the  item  difficulty  and  the  candidate  design  points.  We  propose  the 
constraint  on  each  design  point:  {P(Xj;/3)[1  -  P(X|;iS)]}h  2  v,  where  v  and  h 
are  ruxi-negative  fixed  constants.  The  effect  of  this  constraint  is  to 
eliminate  certain  candidate  design  points  that  are  outlying  relative  to  the 
item  difflcuity. 

Another  approach  is  to  modify  the  Rsher  information  matrix  so  that 
the  0-optimal  design  points  are  not  the  extreme  points  in  the  design  space. 
An  easy  modification  is: 


M(x;^.h)=  2<^-2*'‘(Xj;i8) 

1=1 


dP(Xj;^  aP(Xj:/8)T 

a/s  a/s 


(32) 


where  h  is  a  non-negative  constant.  The  integer  programming  problems 
using  this  criterion  is  solvable  with  the  methods  proposed  here. 
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Appendix  A 


Suppose  we  let  the  design  space  I  be  [-1 .  +1]  and  it  is  known  that 
0Q  0.  Ford  (1976)  has  shown  that  for  the  two  parameter  logistic 

function  the  approximate  optimal  design  puts  one-half  the  mass  at  one 
point  and  one-half  at  another  point.  These  two  points  depend  on  the  value 
of  the  item  parameter  vector,  fi  in  the  following  way. 

Let  c  be  the  positive  solution  of  the  equation  e^  =  c  •  1 .5434. 

Also  let 


(A-1) 

+  1 

Ba  =  {0: 0Q  >O,0^>  0, 0-i-  0Q<  c.  exp(^o  +  >  j—}  (A-2) 

^1  +  1 

B3  =  ^0  >  0.  >  0.  ^1  -  ^0  «^(^0  +  ^1)  ^  f}  (A-3) 

Then 

C  -  0Q  -C  - 

(I)  if  /8  €  ,  the  support  points  are  — jg — — jg — ;  (A-4) 

(ii)  if  /8  €  Ba.  they  are  -1  and  x^  where  Xy  is  the  solution 
2  +  (x  +  1)/8, 

of  exp(^o  +  ^ix)  =  .2 +'S  + 

(III)  if  /8  e  Bg,  they  are  -1  and  +1 .  (A-6) 


This  rather  complicated  design  can  be  roughly  summarized  by  saying  the 
more  steep  the  response  curve  is,  the  more  the  support  points  are 
squeezed  to  the  center  of  the  design  space;  the  more  flat,  and  hence  linear. 
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the  response  curve  Is,  the  more  the  support  points  are  pushed  to  the 
extremes  of  the  design  space. 
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Appendix  B 


Let  Xj,  i  =  1 ,2 . r  denote  candidate  design  points.  The  N-point  D- 

optimal  design  problem  is  to  place  N  observations  at  the  r  design  points  so 
as  to  maximize  det{M(T?|^;/3)}.  There  are  ^  j  possible  designs,  some 

leading  to  a  zero  determinate.  Instead  of  performing  an  exhaustive  search 
over  all  possible  designs,  it  is  possible  to  partition  the  set  of  designs  and  to 

perform  searches  over  a  much  smaller  set  of  designs.  Let  I  =  (l-i  ,l2 . Ip) 

and  u  =  (u^  ,U2..-..Ur)  be  collections  of  non-negative  integers  less  than  or 
equal  to  N.  The  maximum  determinate  exceeds  or  equals  the  solution  to: 


maximize  det(M(n(s|;l^)} 
r 

such  that  Z  =  N, 
i=1 

IjinjiUj,  i=  1,2,...,r. 


(B-1) 


We  call  this  maximization  a  nod&  The  original  maximization  is 
called  the  rootnocfe^Mi  all  Ij  =  0  and  u^  =  N.  The  collection  (B-1)  of 
designs  is  further  subdivided  into  two  nonempty  partitions,  as  follows: 


nj  =  Ij,  Ij  ^  nj  ^  Uj.  i  ^  j 

Ij  +  1  i  nj  ^  Uj,  Ij  ^  nj  ^  U|,  \*\ 


(B-2) 

(B-3) 


where  the  way  j  will  be  chosen  is  described  shortly.  Thus  we  create  two 
nodes  by  replacing  (B-1)  with  either  (B'2)  or  (B-3). 
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Thus  every  node  has  either  zero  or  two  branches  leading  from  it, 
creating  a  binary  tree  with  the  ^  j  N-point  designs  located  at  the 
extreme  nodes.  We  guide  the  search  for  the  optimal  design  by  going  up 
the  tree  along  branches  that  are  not  suboptimal. 

We  avoid  suboptimal  branches  by  calculating  a  bound  on 
det{M(i7f4;i8)}  over  all  designs  leading  from  a  common  node  (l.u)  on  the 
branch  as  follows.  Define 

r 

N‘=  Slj  (B-4) 

1=1 

M(U:^  =  (I,  +  e)a  Z  (B-5) 

Vi 

d(Xj.l,fi;/8)  =  a  -2(Xi./»j)  M-1  (l.e;/J)  (B^ 

where  e  is  a  small  positive  number  to  ensure  that  M(l,e;/9)  is  positive 
definite. 

The  determinant  of  the  information  matrix  where  the  design  satisfies 
the  constraints  (B-1)  satisfies  the  following  bound  (Welch,  1982) 


det{M(t?N:/8)}  ^  det{Mfl.e:/8)H1  +  d(l,e;/3)}N-N- 

(B-7) 

where 

max 

d(l.€U»)  =  ‘=ito'd(Xj,l,e:/J) 

•i<«i 

(M) 

The  value  for  J  that  is  used  to  branch  at  the  node  is  the  value  of  i  for 
which  the  maximum  d(Xj,l,e;/^  is  attained. 
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Appendix  C 

The  Seouentiai  Desion  Algorithm 

StepO 

a)  Choose  S,  the  effective  sample  size  for  switching  from  the  regular 
MLE  to  the  modified  estimator. 

b)  Choose  N^,  the  maximum  number  of  observations  to  be  gathered  via 
the  sequential  design  algorithm. 

c)  Choose  K^,  the  test  reliability  ratio. 

d)  if  this  is  a  simulation  choose  the  item  response  model  and  its 
parameter  vector 

e)  Choose  n,  the  sample  size  for  the  subdesign.  Set  N  =  n  initially. 

Q  Choose  X|.  N  1  ....,n.  an  initial  design. 

Step  1 

a)  Pool  the  design  points  X|.  i  =  1  ,...,n  with  all  prior  design  points. 

b)  Determine  o  for  the  set  of  pooled  design  points. 

d)  if  this  is  a  simulation,  randomly  generate  X|  =  X|  -t-  e|,  where 
follows  N(0,o^,  i  =  1  ,...,n.  If  this  Is  part  of  a  real  time  system,  find  n 
examinees  with  estimated  ability  level  X|  and  measurement  error 

I  =  1 . n. 

e)  If  this  is  a  simulation,  obtain  n  random  responses  Uj  at  latent  ability 

X|,  according  to  the  true  item  response  model  Pr{U|  =  1  |X|,/3}.  If  this 

is  a  part  of  a  real  time  system,  obtain  a  response  to  the  item  from 
each  of  the  chosen  n  examinees. 
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f)  Pool  Uj.Xj  with  all  prior  data.  I  =  1  .....n. 

Step  2 

a)  Find  P  for  the  current  design,  P  =N-‘'  2  u.. 

all  data 

b)  Rnd  the  effective  sample  size,  N'=  NP(1-P). 

c)  If  N’  >  S,  obtain  modified  maximum  likelihood  estimates  of  the  item 
parameters.  Otherwise  obtain  regular  maximum  likelihood  estimates 
of  the  item  parameters. 

Step  3 

a)  N  =  N  +  n 

b)  if  N  ^  N^,  stop.  Otherwise  continue. 

c)  Based  on  the  current  item  parameter  estimates,  find  ,...,X^,  the 

exact  n-point  optimal  design  using  branch-and-bound  and  the 
criterion  of  D-optimality. 

d)  Goto  step  (1. a). 
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Table  1 

Optimal  Designs  on  [-1 ,  +1]  for  Two-Parameter  Logistic  Items 


Two-Parameter  Logistic  Response  Function 

a,  b 

Low 

1.2030,  -3459 

Medium 

1.7326,  -.2402 

High 

23851,  -.1745 

X 

-1 

1 

-1  .73 

-32  .47 

rm 

.5 

3 

3  3 

3  3 

Mx 

0 

-.135 

-.175 

<^xx 

1 

.748 

.416 
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Table  2 

Relative  Efficiencies  for  a  Normal  NfO.  1 .7)  and  a  Rectangular  Uf-l.+ll 
Design  In  Estimating  Two-Parameter  Logistic  Items 


Two-Parameter  Logistic  Response  Function  (a,b) 

Low 

Medium 

High 

Design 

1.2030,  -3459 

1.7326,  -.2402 

23851,  -.1745 

N(  0,1.7) 

.2028 

.2252 

.2356 

U(-l,  +1) 

.0859 

.1194 

.1544 
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Tables 

RelaMve  Efficienciea  of  Seouentiai  Designs  N  *  200 


Magnitude  of  Discrimination  Parameter 

Small 

Medium 

Large 

Effective  Sample  Size§ 

Reliability 

1 

OO 

1 

OO 

1 

OO 

n 

=  3.N=198 

.85 

.96 

.98 

.90 

.95 

.89 

.86 

.90 

.98 

.99 

.87 

.95 

.79 

.87 

.95 

.98 

.99 

.93 

.96 

.80 

.88 

1.0 

.99 

.99 

.93 

.93 

.89 

.89 

n 

=  5,N=200 

.85 

.95 

.98 

.88 

.97 

.89 

.90 

.90 

.95 

.98 

.86 

.97 

.88 

.91 

.95 

.96 

.98 

.96 

.95 

.90 

.91 

1.0 

.98 

.98 

.95 

.95 

.91 

.91 

n  : 

=  15,N=180 

.85 

.99 

1.0 

.94 

.95 

.86 

.80 

.90 

1.0 

1.0 

.90 

.93 

.87 

.77 

.95 

1.0 

1.0 

.91 

.93 

.85 

.77 

1.0 

1.0 

1.0 

.93 

.93 

.85 

.77 

§"«”  means  the  effective  sample  size  was  so  laige  that  the  r^lar  M.L.E.  was 
used  always. 
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Table  4 

Relative  Efficiencies  of  Sequential  Designs  N  *  400 


Magnitude  of  Discrimination  Parameter 


Small  Medium  Laige 


Effective  Sample  Size§ 


Rel 

1 

15 

30 

00 

1 

15 

30 

00 

1 

15 

30 

00 

n  = 

3,N  = 

402 

.75 

.95 

.96 

.99 

.99 

.73 

.95 

.94 

.97 

.91 

.91 

.93 

.89 

.80 

.97 

.96 

.98 

.98 

.72 

.95 

.95 

.97 

.92 

.94 

.95 

.92 

.85 

.98 

.95 

.98 

.98 

.85 

.96 

.96 

.97 

.93 

.93 

.94 

.91 

n  = 

5,N=400 

.75 

.82 

.97 

.98 

.99 

.77 

.97 

.98 

.98 

.91 

.91 

.94 

.95 

.80 

.92 

.99 

.99 

.99 

.84 

.97 

.97 

.98 

.89 

.89 

.94 

.95 

.85 

.97 

.97 

.99 

.99 

.93 

.97 

.98 

.98 

.90 

.90 

.94 

.95 

§”00"  means  the  effective  sample  size  was  so  large  that  the  r^lar  M.L^  was 
used  always. 
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Table  C-1 

CarwIMate  Ability  Lgvela 


Support  Points 

Associated  Items 

Pair  Number 

Left  Ri^t 

a  b 

1 

-1.000 

1.0000 

1.2030 

-3459 

2 

-.9325 

.9925 

1J922 

-.2989 

3 

-.8735 

.8529 

1.5056 

-3764 

4 

-.8216 

.7315 

1.6191 

-.2570 

5 

-.7755 

.6371 

1.7326 

-.2402 

6 

-.7343 

.5753 

1.8461 

-.2254 

7 

-.6972 

.5364 

1.9595 

-.2124 

8 

-.6637 

.5025 

2.1014 

-.1980 

9 

-.6333 

.4726 

2.2432 

-.1855 

10 

-.6055 

.4461 

2J851 

-.1745 
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Rgure  Caption 

Rqure  1 .  Row  of  tasks  for  sequentially  constructing  a  design  for  non¬ 
linear  response  models. 
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